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Abstract 

Mapping expression quantitative trait loci (eQTLs) has been shown as a powerful tool to uncover the 
genetic underpinnings of many complex traits at the molecular level. In this paper, we present an in- 
tegrative analysis approach that leverages eQTL data collected from multiple population groups. In 
particular, our approach effectively identifies multiple independent cis-eQTL signals that are consistently 
presented across populations, accounting for population heterogeneity in allele frequencies and linkage 
disequilibrium patterns. Furthermore, by integrating genomic annotations, our analysis framework en- 
ables high-resolution functional analysis of eQTLs. We applied our statistical approach to analyze the 
GEUVADIS data consisting of samples from five population groups. From this analysis, we concluded 
that i) jointly analysis across population groups greatly improves the power of eQTL discovery and the 
resolution of fine mapping of causal eQTL. ii) many genes harbor multiple independent eQTLs in their cis 
regions iii) genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs 
(p-value = 4.93 x 1(T 22 ). 

Author Summary 

Expression quantitative trait loci (eQTLs) are genetic variants associated with gene expression pheno- 
types. Mapping eQTLs enables studying genetic basis of gene expression variation. In this study, we 
introduce a statistical framework to analyze genotype-expression data collected from multiple population 
groups. We show that our approach is particularly effective in identifying potentially multiple indepen- 
dent eQTL signals that are consistently presented across populations in the proximity of a gene. In 
addition, our analysis framework allows effective integration of genomic annotations into eQTL analysis, 
which is helpful to dissect the functional basis of eQTLs. 

1 Introduction 

Expression quantitative trait loci, or eQTLs, are genetic variants that are associated with gene expression 
levels. Mapping eQTLs can help in dissecting the molecular mechanisms by which genetic variants impact 
organismal phenotypes. Recent studies ( have revealed that there are substantial overlaps between 

eQTLs and genetic variants identified from genome-wide association studies (GWAS) of disease pheno- 
types. In addition, eQTL mapping provides a powerful tool for investigating the regulatory machinery 
in different tissues ( [4|[5]) or cellular environments ( [6jj8]). 

In this paper, we jointly address three outstanding issues in eQTL mapping. First, due to the high 
experimental cost, most available eQTL data sets typically have limited sample size. To improve power 
of eQTL discovery, it becomes necessary to aggregate evidence across multiple data sets. Second, because 
a gene is typically regulated by many regulatory elements, it is highly likely that there exist multiple 
independent eQTLs in its proximity (i.e., cis region). In this scenario, a multiple SNP analysis is required 
to uncover all relevant cis acting genetic factors involved in the gene regulation process ( [9]). Third, 
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the availability of extensive functional annotations ( |10| - |12| ) now enables integration of functional ge- 
nomic information into eQTL analysis, which can be useful to dissect the functional basis of eQTLs . 
Linking genomic annotations to eQTLs goes beyond genetic association analysis, and helps gain a better 
understanding of the underlying biological processes. Individually, some of these three issues have been 
discussed by previous works. For example, |3, 9j jT3}fl6] discussed single SNP analysis of eQTLs jointly 
from different studies, populations or tissues. But these methods do not naturally extended to multiple 



SNP analysis. 17-20 examined the enrichment of selected genomic features in cis-eQTLs, mostly based 



on single SNP association results. To the best of our knowledge, there is no existing approach that jointly 
addresses all three issues in a systematic way. 

In this paper, we demonstrate an integrative analysis approach to perform fine mapping and functional 
study of eQTLs using cross-population samples. Our statistical methods are extended from a Bayesian 
framework proposed by |14||15[|2l] , which has been successfully applied in mapping eQTLs from multiple 
tissues. We apply our statistical framework to analyze the data from the GEUVADIS project ( [20]), 
where the expression-genotype data are collected from five population groups. In GWAS, trans-ethnic 
meta-analysis of genetic association data from diverse populations has been shown to be a powerful 
tool in detecting novel complex trait loci and improving resolution of fine mapping of causal variants by 
leveraging population heterogeneity in local patterns of linkage disequilibrium (LD) and allele frequencies 
( [22j[23]). This approach, to the extent of our knowledge, has not been applied to eQTL analysis. 
Utilizing cross-population expression-genotype data, we are interested in identifying eQTL signals that 
are consistently presented in all populations. Furthermore, we aim to examine whether we have sufficient 
statistical power to identify multiple independent czs-eQTL signals with the available aggregated sample 
size. Last but not least, we set out to investigate the role of genetic variants that disrupt transcription 
factor (TF) binding in transcriptional processes. More specifically, we ask the question if such variants 
more likely to be eQTLs. The three of our main aims are also inter-related. With higher power through 
sample aggregation in mapping eQTLs, we expect to identify potentially multiple genomic regions that 
harbor casual eQTLs at a high resolution. And consequently, we anticipate that these efforts improve 
the statistical power and precision of localization for our functional analysis. 



2 Results 



2.1 Method Overview 

We start by a brief description of our statistical procedure and general strategy for multiple SNP fine 
mapping analysis in a meta-analytic setting across multiple populations. 



2.1.1 Statistical Model and Inference 

Consider a genomic region with p SNPs that is interrogated in s different population groups. In each 
group i, we use a multiple linear regression model to describe the potential genetic associations between 
the p SNPs and the expression levels of a target gene: 

p 

y i =/iil + ^y9 i ,ifli,j+e i) e 4 - N(0, afl), i = l,...,s, (1) 

3=1 

where the vectors y i and represent the expression levels and the residual errors in population group 
i, the parameters fii and o~f denote the population group specific intercept and residual error variance, 
respectively. The vector g i j denotes the genotype of SNP j in population group i, and the regression 
coefficient fiij represents its genetic effect. Across all population groups, the s linear models form a 
system of simultaneous linear regressions (SSLR, 15 1). The problem of mapping eQTLs can be framed 
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as identifying SNPs with non-zero (3ij values based on ([T]). For each SNP j, we define an indicator vector 

lj ■= -^0),...,1(/3 SJ ^0)) 

representing its association status in each of the s population groups. Such indicator is referred to as a 



"configuration" in the literature of genetic association analysis across multiple subgroups ( [14 15 24 ). 
For each target gene, our computational procedure is designed to make joint inference with respect to 
r := {7 l5 . . . , 7 p } given observed genotype data, G :— {g 1 1; . . . , g l p , . . . , g s p }, and phenotype data, 
Y :={y v . ..,y a }. 

For mapping eQTLs in a meta-analytic setting, we make an important assumption that if a SNP is 
genuinely associated with a given expression phenotype, its underlying genetic effects are non-zero in all 
population groups. That is, 7- = 1, if SNP j is an eQTL, and 0 otherwise. This assumption is largely 
motivated by the biological hypothesis that the regulatory mechanisms behind eQTLs should remain the 
same across populations. 

For each SNP j, we assume an independent prior for 7 ., which assigns most of the probability mass 
on 7 ■ = 0 and encourages an overall sparse structure of T. This is because most previous studies only 
identify small numbers of cis-eQTLs for any given gene. Particularly in our fine mapping analysis where 
p is typically in the magnitude of 10 3 to 10 4 , we use 

p r(7, = l) = ^ (2) 
which implies that we expect a single cis-eQTL signal for the target gene a priori, because 

E E 1(7,^0) 

We will further justify this prior specification based on our overall strategy for fine mapping analysis in 
section [2. 1.3[ and discuss some other alternative specifications for enrichment analysis in scction |2.1.4| 

Conditional on SNP j being an eQTL, we model its genetic effects across populations, /3 ■ :— (/?i j, . . . , f3 s ,j) 
using a flexible Bayesian prior for meta-analysis proposed in [21] . Briefly, we assume that the effect sizes 
of an eQTL are correlated while allowing reasonable degree of heterogeneity across population groups 
(See Method section for details). 

Given observed data (Y, G) and a specified T value, a Bayes factor 

P(Y\G,T) 



P(Y |G,r = 0)' 



can be analytically approximated with high precision following [15 j . Based on this result, we compute 
the posterior of T using the Bayes rule, i.e., 



Pr(T I Y, G) oc Pr(T) P(Y \ T, G) 

n pr (7i)l BF(r). 



oc 



(3) 



We implement an efficient MCMC algorithm to perform the posterior inference and summarize the fine 
mapping results from the posterior samples. More specifically, we compute the posterior probability for 
each possible T by the corresponding frequency in the posterior samples. We will refer to this quantity 
as "posterior model probability" henceforth. To evaluate the importance of each SNP, we compute a 
posterior inclusion probability (PIP) for each SNP by marginalizing over all posterior model probabilities. 
If a SNP is included in posterior models with high frequencies, the corresponding PIP tends to be large. 
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2.1.2 Dealing with Genetic Data from Multiple Populations 

We note that analyzing eQTL data using samples collected from multiple populations is slightly different 
from the standard settings in meta-analysis of genetic associations. In particular, the allele frequencies 
of interrogated SNPs and/or patterns of linkage disequilibrium presented in genotype data can be highly 
heterogeneous in different populations. The proposed multiple SNP fine mapping procedure takes ad- 
vantage of the unique setting of cross-population samples, and leverages the statistical power for eQTL 
discovery. 

The allele frequency of a SNP is unrelated to its underlying genetic effect with respect to expression 
levels. However, because of its direct impact on sampling errors, it affects the precision of estimated 
fiij (denoted by Pij) in model (JlJ. Lower allele frequency typically results in larger uncertainty (e.g., 
the standard error of fiij of a rare SNP is usually larger than that of a common SNP), which implies 
that rare SNPs tend to be less informative. In the extreme case if a SNP is monomorphic in samples 
from a particular population, it should be considered completely uninformative for examining genetic 



associations. As shown by 15 , Bayesian procedures, especially in use of Bayes factors, can precisely 
capture the (non-)informativeness of SNPs in each population group: e.g., including the genetic data 
of monomorphic SNPs yields identical inference results as discarding such SNPs from corresponding 
population groups. On the other hand, our adopted Bayesian meta-analysis prior enables aggregating 
less informative or weak association signals from low frequency SNPs, as long as they are consistent across 
population groups. 

In pursuing consistent association signals, our fine mapping procedure takes advantage of potential 
varying LD patterns across multiple population groups. This is mainly because our method favors 
identifying eQTLs whose effects are consistent in all population groups, whereas SNPs that tag causal 
variants only in some populations (due to population-specific LD structures) are automatically down- 
weighted. As a consequence, the genomic regions that harbor causal eQTLs can be effectively narrowed 
down using cross-population data. This advantage becomes even more obvious when performing multiple 
SNP analysis, as all candidate czs-SNPs are simultaneously evaluated. 

2.1.3 Gene-level Testing Prior to Fine Mapping cis-eQTLs 

In most of currently available data sets, eQTL discoveries are made only in a proportion of genes. To 
reduce the computational cost of the fine mapping analysis, we adopt a practical procedure that first 
screens the genes having at least one cis-eQTL (which we will call as "eGenes"). 



The statistical procedure to identify eGenes across multiple subgroups has been well established in 14 
More specifically in the context of cross-population eQTL mapping, it assumes a similar linear model 
system as ([l]), and performs gene-level Bayesian hypothesis testing. Namely, for each gene, we test 

H 0 : 7i = ' ' ' = 7 P = 0 

versus 

Hi : some jj — 1. 

The eGenes are identified upon rejections of corresponding null hypothese, and we only follow up the 
eGenes with the multiple czs-eQTL analysis. 

The gene- level testing is computationally efficient, it effectively filters out a substantial set of genes 
that are unlikely to be interesting for fine mapping. Furthermore, this additional procedure justifies our 
use of prior ([2]): given that a gene is identified as eGene, expecting a single cis-eQTL prior to the fine 
mapping analysis seems, in average, a slightly conservative assumption. 
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2.1.4 Integration of Genomic Feature and Enrichment Analysis 

In the enrichment analysis, we are interested in identifying some common properties shared by eQTL 
SNPs genome-wide. Intuitively, in contrast to the fine mapping procedure where each gene is separately 
processed, the enrichment analysis requires jointly analyzing all gene-SNP pairs. 

Our Bayesian framework provides a natural way to integrate genomic features of SNPs into the 
association analysis. In particular, we use the prior specification of Pr( 7 ■) to incorporate the genomic 
annotations of SNP j with respect to the target gene. More specifically, we assume a general logistic 
model, 

"Pr( 7j = 1 



log 



Pr( 7 = 0 



za Q + ^ a khj, (4) 
fc 

where 6k j denotes the fc-th annotation for SNP j. The regression coefficient «fc represents the strength 
of association between the k-th genomic annotation and the prior probability of a SNP being an eQTL, 
and is assumed to be invariant across all gene-SNP pairs. For a binary annotation, a positive a& implies 
that an annotated SNP has higher odds to be an eQTL, or equivalently speaking, the annotated SNPs 
are enriched in eQTLs. For the remaining of this paper, we will refer to ctk's as enrichment parameters. 

The inference procedure of the Bayesian model with prior specification Q is conceptually straight- 
forward, and we give the details in Method section. However, because the estimation of the enrichment 
parameters requires pooling information across all genes, the computational cost is substantially higher 
comparing to the gene by gene fine mapping analysis. 

To ease the computational burden, we derive an approximate inference procedure focusing on hypoth- 
esis testing of a^'s. This procedure first performs separate multiple czs-eQTL mapping for every gene 
interrogated (i.e., not just eGenes), using a special case of the prior model Q, i.e., 



log 



Pr( 7 , = 1) 



Pr( 7 , = 0) 



a 0 . 



In particular, we determine cto m a similarly conservative way as ([2]), using the results from gene-level 
testing. Let g e and g s denote the number of eGenes identified and the number of total gene-SNP pairs 
in the analysis, respectively. We set 

Pr( 7i = 1) = k = ^, (5) 

9s 

which, in a way, assumes that each eGene contains only a single associated gene-SNP pair, and corresponds 
to ao = log (jz^j ■ We then fit a logistic regression model by correlating the resulting PIP of each gene- 
SNP pair with the genomic annotations of the corresponding SNP. The fitted regression coefficient of each 
annotation is then regarded as the corresponding otk estimate. This approximate inference procedure 
seemingly resembles some of the post hoc enrichment analysis approaches that are commonly applied 
( [7||9|[20]). We provide a statistical justification of this procedure in Method section. 



2.2 Analysis of GEUVADIS Data 

In this paper, we focused on analyzing the expression and genotype data collected from the GEUVADIS 
project ( [20]). More specifically, the data set consists of RNA-seq data on lymphoblastoid cell line 
(LCL) samples from five populations: the Yoruba (YRI), CEPH (CEU), Toscani (TSI), British (GBR) 
and Finns (FIN). In our analysis, we selected 420 samples which were densely genotyped in the 1000 
Genomes Phase I data release ( [25]) and 11,838 protein coding genes and lincRNAs that are deemed 
expressed in all five population groups. Throughout, our analysis focused on the SNPs that locate within 
a 200kb genomic region centered at the transcription start site (TSS) of each gene. In contrast to the 



original eQTL mapping discussed in 20 , we treated each population as a single group and performed 



cis-eQTL analysis jointly across all five groups. 
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2.2.1 Power Gain in gene-level Meta-Analysis 

We started our analysis of the GEUVADIS data by performing gene-level testing jointly across all five 
population groups. 

In total, we identified 6,555 eGenes from 11,838 tested protein coding and lincRNA genes at 5% FDR 
level. For comparison, the numbers of eGenes identified using each population data alone are given in 
Table[T] The separate analysis identified no more than 2,100 eGenes in any of the population groups. The 
union of the eGenes from the separate analysis yielded 3,447 genes, the vast majority of which (except for 
60 genes) were included in the discoveries by the meta-analysis. In addition, the meta-analysis identified 
a total of 3,168 new eGenes. 

Examining the set of eGenes uniquely identified in the meta-analysis, we found that they share 
the following common feature: when performing separate analysis within each population group, the 
strongest association in each respective c«s-region only shows modest strength and does not pass the 
required significance threshold; However, across population groups, the same association signal tends 
to be highly consistent, and the overall evidence aggregated across population groups becomes quite 
strong. As a consequence, the joint analysis of all population groups is able to detect such signals. We 
demonstrate one of such examples in Figure [I] using gene NME1 (ensemble ID: ENSG00000239672), 
where the gene-level Bayes factor in the joint analysis is several order of magnitude higher than the 
corresponding gene-level Bayes factors in each separate population group analysis. 



2.2.2 Multiple SNP analysis of eGenes 

We followed up the gene-level analysis by performing multi-SNP fine mapping for the set of identified 
eGenes across all 5 population groups. 

One of our primary aims was to identify potential multiple independent cis-eQTL signals while ac- 
counting for varying LD patterns in different populations. First, we asked how often we can identify 
multiple cis-eQTL signals in eGenes. To this end, for each fine mapped eGene, we computed the ex- 
pected number of independent cis-eQTL signals from the corresponding posterior distributions (Method 
section). Figure [2] shows the histogram of posterior expected cis-eQTL signals in all 6,555 eGenes. It is 
clear that for the available (accumulated) sample size in the GEUVADIS data, we identified only single 
cQTL signals for the majority of the eGenes. Nevertheless, for a non-trivial proportion of genes, there 
are strong evidence that multiple cis-acting regulatory genetic variants co-exist and can be confidently 
identified by our fine mapping procedure. More specifically, there are about 14% of the eGenes (or 7% 
of all interrogated genes) with the posterior expected number of cis-eQTLs > 2. 

In the case of gene LHPP (Eensemble ID: ENSG00000107902), four independent eQTL signals were 
confidently identified (Figure [3]). Each eQTL signal (except one) is represented by a cluster of highly 
correlated SNPs in a small genomic region. Due to LD, within each cluster the correlated SNPs tend to 
have similar PIPs and we could not be certain which SNP is truly driving the association signal. However, 
the sum of the PIPs within each cluster is very close to 1, indicating almost certainty that an eQTL is 
located within the region. There were 134 different posterior models examined for gene LHPP in the 
sampling phase of the MCMC run. Interestingly, every model contains exactly 4 SNPs (which results in 
posterior expected number of cis-eQTLs being 4 with variance 0), each from one of the four independent 
clusters. 

As discussed in section 2.1. 2[ even in the cases when only a single cis-eQTL signal can be identified, we 



observed that the fine mapping procedure takes advantage of varying LD patterns across populations and 
narrows down the set of candidate causal variants by down- weighting SNPs that tag causal variants only in 
some populations. For example, in analyzing the data of gene AG03 (Ensemble ID: ENSG00000126070) 
from TSI alone, we identified a strong single eQTL signal within a 144kb region. The region contains 
41 SNPs that are in perfect LD and show equal strengths of associations. In other four populations, 
this particular region is broken into smaller LD blocks and the associations for the 41 SNPs become 
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distinguishable. As a result, when performing multiple-SNP analysis across all populations, we narrowed 
down the potential causal eQTL into a 1.2kb region, with only 3 candidate SNPs (in high LD in all 
populations) fully explaining the observed association. 

To further quantify the effect of LD filtering, we selected a set of 526 eGenes that highly likely harbor 
exactly one ds-eQTL based on our multiple SNP analysis. More specifically, we selected the genes whose 
posterior expected number of czs-eQTLs = 1 and variance < lx 10 -4 . We then ran multiple cis-eQTL 
analysis separately in each of the five populations for each selected gene. For every multiple SNP analysis 
of each gene, including the meta-analysis, we constructed a 95% credible set that contains the minimum 
number of SNPs with the sum of PIPs > 0.95, and defined its credible region length as the distance 
between the the right-most and left-most SNPs in the set. In 486 out of 526 (or 92%) genes, we found 
that the cross-population meta-analysis yields the smallest credible region length. The median ratio of 
credible region length by the meta-analysis to the minimum credible region length by the corresponding 
separate analyses is 0.50 across all 526 genes, indicating that, in average, cross-population meta-analysis 
can effectively narrow down the genomic regions that harbor cis-eQTLs. 

Multi-SNP analysis can also be very helpful in explaining some of the extreme heterogeneity of eQTL 
effects across populations observed in single SNP analysis. In particular, we identified a few SNPs that, 
when analyzed alone, appear to show strong but opposite genetic effects on expression levels in different 
populations. It seemingly suggests that a particular allele of the variant increases expression levels 
of the target gene in one population and decreases expression levels in another population. However, 
going through all such individual examples, we found none of the opposite effect association patterns 
is supported by our multiple SNP analysis. In mapping cis-eQTLs for gene TTC38 (Ensemble ID: 
ENSG00000075234), we found a set of tightly linked SNPs displaying opposite directional effects in YRI 
and the European populations in single SNP analysis. For example, the A allele of SNP rs6008600 
shows consistently strong negative effect in the four European populations, whereas in YRI the same 
allele displays a highly significant positive effect (Figure [5]) . The multiple SNP fine mapping offered a 
compelling explanation for this phenomenon: two independent cis-eQTL signals were identified in the 
nearby regions, and interestingly, SNP rs6008600 tags one signal in YRI (r 2 sa 0.73), whereas in the 
European populations, it is highly correlated (e.g., in CEU r 2 1) with the other signal (Figure pi). 

Although our multiple cis-eQTL mapping method confidently identified independent association sig- 
nals accounting for LD, in most of the examples we have examined, it usually could not pinpoint a single 
causal SNP by fully resolving LD relying only on the association data. This is because highly corre- 
lated SNP genotypes are nearly "interchangeable" in our statistical association models, and therefore not 
identifiable. As a consequence, there are many combinations of SNPs showing equivalence or near equiv- 
alence based on observed data. Reporting a single "best" model while ignoring its intrinsic uncertainty 
can be highly problematic. In the previously mentioned example of gene LHPP, we found that a large 
proportion of the 134 reported posterior models by the MCMC algorithm exhibit very similar likelihood 
and posterior probabilities, and the maximum posterior model probability is only 0.02. Our Bayesian 
fine mapping approach summarizes the measure of uncertainties using the posterior probabilities both at 
model and SNP levels, which can be naturally carried over to potential downstream functional analysis. 



2.2.3 Functional analysis of eQTLs 



We applied the approximate inference procedure described in section 2.1.4 to perform enrichment analysis 
using the GEUVADIS data. In particular, we conducted the multiple czs-eQTL analysis for all 11,858 
genes using the prior (|5j). Notwithstanding the conceptual difference between priors Q and the 
overall results of multiple cis-eQTL analysis for identified eGenes are markedly similar. Based on these 
result, we set out to investigate the relationships between eQTLs and some functional genomic features. 

We started with examining the enrichment of eQTL signals with respect to their distances to the 
TSS of respective target genes. More specifically, we computed the posterior expected numbers of eQTL 
signals within the non-overlapping 1Kb windows defined by their distances to TSS by summing over the 
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PIPs of all cis-SNPs falling in each respective window. The results are summarized in Figure [6j It is clear 
that cis-eQTLs tend to cluster around TSS, and the decay of eQTL signals away from TSS is fast. In 
particular, Figure [6] suggests that 50% cis-eQTL signals are concentrated within 20kb of TSS. Although 
this phenomenon is well known ( [5 18), our results display a much cleaner pattern comparing to the 



previous reports. This is most likely because we considered multiple cis-eQTL signals for a target gene, 
and our use of PIP accounted for the uncertainty of eQTL calls. 

We next asked whether eQTLs are enriched for genetic variants disrupting TF binding. Answer- 
ing this question helps reveal the underlying regulatory logic between transcription factor binding and 
gene expression ( [l7|[lSj[26j[27j). To this end, we used the base-pair resolution quantitative annotation 



of binding variants from the extended CENTIPEDE model ( [ll][28j). In brief, this annotation was 
generated starting from DNase-seq data (for LCLs in the ENCODE project) and seed position weight 
matrices (PWM) models from TRANSFAC (http://www.gene-regulation.com/pub/databases.litml) 
and JASPAR (http://jaspar.genereg.net/) databases describing the sequences that the TFs prefer 
to bind. CENTIPEDE was then used to fit a mixture model that learns for each TF motif: 

1. a characteristic footprint shape that evidences active binding 

2. a re-calibrated sequence PWM model that is useful to predict the the impact of a genetic variant 
may have on binding 

Each genetic variant from the 1000 Genomes project was annotated as one of the following three mutually 
exclusive categories: 

• SNPs that are not located in any DNasel footprint region, or in brief, baseline SNPs 

• SNPs that are in a footprint region but predicted to have little or no impact on TF binding, or in 
brief, footprint SNPs 

• SNPs that are in a footprint region and predicted to strongly affect TF binding, or in brief, binding 
variants 

About 4% of the total 6.7 million interrogated cis-SNPs are annotated as binding variants, and another 
4% are annotated as footprint SNPs. Additionally, we controlled for SNP positions with respect to TSS 
in the model. Our main findings from this analysis are: 

• binding variants are 1.49 fold (with 95% confidence interval [1.38, 1.63] ) more likely than baseline 
SNPs to be eQTLs, its enrichment in eQTLs is statistically highly significant (p- value = 4.93 xlO -22 ) 

• footprint SNPs are 1.15 fold (with 95% confidence interval [1.04, 1.27] ) enriched in eQTLs, com- 
paring to baseline SNPs, with the corresponding p-value = 0.0035 

Our results implies that the sequence variants that potentially disrupt TF binding are more likely to have 
a functional impact on gene expression. In comparison, footprint SNPs (those predicted to not affect 
binding) are less likely to affect expression and this is reflected with a relatively lower level of enrichment 
for eQTLs. 

Given the results of the enrichment analysis, the corresponding annotation information can be quanti- 
tatively incorporated into the fine mapping analysis. More specifically, we plugged in the point estimates 
of the enrichment parameters to re-compute the Pr("yy) using the logistic model Q for each gene-SNP 
pair, and re-ran the MCMC algorithm with the updated priors. As a result, the priors for binding vari- 
ants were up-weighted comparing to the nearby baseline and footprint SNPs. Because the computation 
of likelihood function was intact, the PIPs for binding variants were accordingly up-weighted. Using this 
approach, it becomes plausible to quantitatively distinguish SNPs in perfect LD but belonging to differ- 
ent annotation categories. Take Gene LY86 (Ensemble ID: ENSG00000112799) as an example (Figure 
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[7]). Before incorporating any annotation information, each cis-SNP was equally weighted a priori and 
the multiple czs-eQTL analysis identified three independent signals, two of which were represented by 
clusters of highly correlated SNPs. Utilizing the quantitative annotation information, the fine mapping 
procedure still confidently identified the three original eQTL signals, however two binding variants be- 
came the top associated SNPs (ranked according to PIPs) in each respective cluster of SNPs that were 
indistinguishable in the original analysis (Figure [7]) . 

In the set of 526 genes that we confidently identified as harboring only one single cis-eQTL signal, 
when applying the equal prior without any annotation information, we found that that binding variants 
and footprint variants are top associated SNPs in 11% and 8% of the genes (or 60 and 43 genes in 
number), respectively. Using the priors based on quantitative binding annotation and distance to TSS, 
the percentages of genes with binding variants and footprint SNPs as top associated SNPs increase to 
16% and 11% (or 85 and 57 genes in number), respectively. 

3 Discussion 

In this paper, we have presented a systematic approach to jointly analyze eQTL data from cross- 
population samples. Our core statistical methods are built on the established Bayesian framework for 



association analysis of genetic data from heterogeneous groups 14 15 21 . For the first time, we have 
applied them for multiple SNP analysis in a cross-population meta-analytic setting: with a combined 
sample size ~ 400 in the GEUVADIS data, we have demonstrated that multiple independent cis-cQTL 
signals can indeed be effectively identified in many genes. 

The commonly applied strategy for mapping additional association signals is to conduct conditional 
analysis, which can be viewed as a step- wise variable selection algorithm. One of the notable disadvantages 
of conditional analysis is that it only reports a single "best" model in the end, and completely ignores 
its uncertainty. From many of our examples shown in this paper, it is clear that in most cases even we 
are relatively certain about the number of eQTL signals in a cis region, there is typically a great deal of 
uncertainty in determining the true causal SNPs. As a consequence, the posterior probability of the best 
model can be unimpressive, and it is potentially dangerous to only use the information from the "best" 
model in downstream functional analysis, if care is not taken. In comparison, our Bayesian analysis 
provides much more comprehensive information that fully conveys the uncertainty of the inference result, 
and the quantified uncertainty information is naturally propagated in our functional analysis. 

In this paper, we have employed a two-step procedure that first screens eGenes by performing gene- 
level hypothesis testing then carries out multiple SNP analysis for identified eGenes. This procedure 
is analogous to the fine mapping procedure that is commonly used in GWAS, where interesting loci 
are ranked and selected by single SNP association testing before an in-depth analysis focusing on each 
flagged high priority locus. We find this procedure not only yields considerable computational savings, 
but also provides a sound argument to justify our prior specifications for Pr('yy). Indeed, this procedure is 
completely general and can be extended to the fine mapping analysis where subgroups of eQTL data are 
formed by different tissues ( [4|[l4" ) or cellular conditions ( (6,8). Comparing to the meta-analytic setting 
we have encountered in this paper, the parameter space of {7^ } in those applications is more complicated 



(which includes 2 s potential values, where s is the number of subgroups/ tissues). Nevertheless, 14 has 
provided a principled way to "learn" the priors on possible values that / y,- can take by pooling information 
across genes through a hierarchical model. 

Our fine mapping results of eQTLs also demonstrate the benefit of utilizing cross-population samples 
in genetic association studies. Most importantly, the population heterogeneity of local LD patterns 
serves as an efficient filter that narrows down the regions harboring casual eQTLs. Nevertheless, varying 
LD patterns can cause some SNPs to display large degree of heterogeneity across populations in their 
estimated effect sizes from single SNP analysis: in the extreme cases, a SNP may appear to possess strong 
"population specific" effects. As we acknowledge that genuine population specific eQTLs are certainly 
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interesting phenomena and very much likely exist, we suggest interpreting highly heterogeneous eQTL 
signals from single SNP analysis with caution. It may be necessary to carry out multiple SNP analysis, 
as we have demonstrated in this paper, to simply rule out the possibility that the seemingly population 
specific effects are artifacts due to varying LD patterns. 

Our Bayesian inference framework naturally incorporates functional annotations in fine mapping 
cQTLs across population groups. This feature allows us to quantitatively evaluate the enrichment of 
certain functional feature in eQTLs, and in turn to use the quantified enrichment information to prioritize 
annotated SNPs for fine mapping analysis. Overall, our model for enrichment testing is similar to what 
were proposed in |17||18| . However, these previous approaches make simplifying assumption to restrict at 
most one cis-eQTL per gene, such that single SNP association results can be directly used. Our method 
relaxes this assumption and is fully integrated into our multiple SNP analysis procedure. In addition, 
our use of CENTIPEDE annotation to examine the relative enrichment of binding variants and footprint 
SNPs is also novel. Although it is largely expected that binding variants are enriched in eQTLs, it is 
important to note that the level of enrichment for footprint SNPs is much lower than that for binding 



variants. Interestingly, this finding seems concurring with the results reported by 28 where the relative 
enrichment of binding variants vs. footprint SNPs in other cellular and organismal phenotype QTLs is 
examined. 



4 Materials and Methods 



The computational methods used in the analysis are implemented in the software packages FM-eQTL 
(manuscript in prep) and eQTLBMA ( 14 ). They are freely available at|https : //github . com/timf lutr^/ 



eqtlbma and https://github.com/xqwen/fmeqtl 

The fine mapping results of all 6,555 identified eGenes, including scatter plots of PIPs, forest plots of 
top models, and detailed summaries from separate linear regression analysis are made available on the 
website http : //www-personal . umich . edu/~xwen/ geuvadis/ f m_rst_html/ 



4.1 Data Pre-processing 

The genotype and RNA-seq data were directly downloaded from the GEUVADIS project website. We 
selected 420 samples whose genotypes are directly measured in the 1000 Genomes project. The samples 
are evenly distributed in the five population groups, and the detailed breakdown of samples by population 
is shown in Table [TJ 

For the RNA-seq data, we used a slightly more stringent threshold than the original analysis to select 
genes that are expressed in all five populations. Specifically, for each selected gene, we required > 90% 
individuals in each population group to have RPKM > 0.1. From the 17,361 Ensemble genes that passed 
this filter, following the original analysis, we selected a subset of 11,838 genes consisting of annotated 
protein-coding genes and lincRNAs according to GENCODE ( [29]) release 17. We log transformed the 
RPKM values, and used the pipeline employed in [27pl| to remove the effect of GC content on expression 
measurements. We then followed the same strategy as described in [20] to remove latent confounding 
factors using the software package PEER ( [30]). However, unlike the original analysis, we ran PEER 
for each population group separately. In the end, we removed 15, 13, 15, 20 and 20 PEER factors for 
samples from YRI, CEU, TSI, GBR and FIN, respectively. Finally, the expression levels of each gene 
were quantile normalized across individuals separately in each population group. 

For genotype data, we filtered out SNPs whose sample allele frequencies < 0.03 in the overall samples 
across population groups. Note, we did not apply the allele frequency filter in each population group. 
The SNPs passing this filter must have sample allele frequencies > 0.03 in at least one population group. 
In general, as discussed in Method section, the rare SNPs do not impose any statistical or computational 
problems for our analysis. Nevertheless, removing SNPs that are not likely informative in any population 
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group helps improve computational efficiency. Following 17 31 , we defined a 200kb cis region for each 



gene centered at its TSS. In total, the final data set contains 6.7 million gene-SNP pairs. 



4.2 Gene-level Analysis of czs-eQTLs 

For each gene, we tested the null hypothesis which asserts no cis-cQTLs. More specifically, we adopted the 



Bayesian hypothesis testing procedure discussed in 14 . Essentially, 14 assumed a Bayesian model that is 



mostly similar to ([!]), except for an additional simplifying assumption, "at most one czs-eQTL per gene" 
Given a gene with p czs-sNPs, this additional assumption reduces the possible alternative scenarios into 
p single SNP association models, for which a gene-level Bayes factor can be easily computed analytically 
using Bayesian model averaging. In the context of eQTL mapping in multiple tissues, 14 considered all 



possible configurations, i.e., 7^ values, for each assumed associated SNP, whereas in our analysis we only 
allowed 7 ■ e {0, 1} for the reasons discussed in Results section. 

Briefly, for the alternative model where the j-th SNP is assumed the lone eQTL, the linear model (JTJ) 
is simplified to 



Vi = Mil + Aj9l 7 - + e i; e, - N(0, of I), i = 1, . 



(6) 



We model the correlation of genetic effects, ftj's across population groups through the following prior 
specification, 

ft^NG^ 2 ), 



^■~N(0,oj 2 ). 



(7) 



Equivalently, the above prior can be represented by a multivariate normal distribution by integrating out 
the average effect parameter /3j, i.e., 



N(0, W), 



(8) 



where the variance-covariance matrix W is given by 



/0 2 + u? u 2 
, ,2 



W = 



<j> 2 +uj 2 



\ 



The parameter 



j 2 characterizes the overall genetic effects of SNP j, and (f> 2 /(<p 2 +uj 2 ) represents the 



14 



15 



we considered the values of < 



degree of heterogeneity across population groups. Following 
from a set E = {(j> 2 + cj 2 : 0.1 2 , 0.2 2 , 0.4 2 , 0.8 2 , 1.6 2 } which covers a wide range of plausible magnitude 



of genetic effects. We allowed limited degree of heterogeneity by taking 0 /(0 



values from the 



set H = {(f> 2 /((j> 2 + uj 2 ) : 0, 0.1} which reflects our prior belief that effects of genuine genetic association 
should be highly consistent across population groups. Overall, we considered a combination of \E\ x \H\ 
grid for ((f> 2 ,Lu 2 ) values for each alternative model. Given this model, a single SNP Bayes factor BFj can 
be analytically evaluated following 



14 



15 , and the corresponding gene-level Bayes factor is obtained by 



Bayesian model averaging as 



BF 



gene 



E 



BF 



j ■ 



(9) 



Upon obtaining the gene-level Bayes factors, we used the methods implemented in the software pack- 
age eQTLBMA ( [H]) to select eGenes at FDR 5% levels. eQTLBMA implements two types of FDR 
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control procedures: one is a permutation based procedure which converts a gene-level Bayes factor to a 
corresponding p-value and control FDR using Storey's procedure; the other procedure is based on the 
EBF procedure discussed in [32 which directly works with gene-level Bayes factors and avoids any per- 
mutations. We found that the latter approach is much more computationally efficient, however slightly 
conservative. The results reported in Results section are based on the EBF procedure. 

4.3 Multiple SNP Analysis of cis-eQTLs 

In our multiple SNP analysis, we no longer assume "one cts-eQTL per gene", and consider the full range 
of alternative scenarios described by model (111) . To make joint inference with respect to T — {-y i5 . . . , ~f v }, 



we further specify effect size distribution, 

^1^ = 1,^,^ ~N(0,W), (10) 

where W is constructed in the same way as in the gene-level analysis. In particular, we used the same 
set of grid values of (4> 2 ,lu 2 ) in both analyses. Unconditional on 7^, the prior on /3 ■ is a type of "spike- 
and-slab" , where the "slab" is represented by a mixture of multivariate normal distribution, and the vast 
majority of the prior probability mass, 1 — ~, is assigned to the "spike" (i.e., a point mass at 0). 

The described linear model system including the prior specification (i.e., SSLR), is a special case of 
the general system considered by [15]. Given a specified T value, a Bayes factor, BF(r), contrasting to 
the trivial null model, T = 0, can be analytically approximated by applying the result discussed therein. 
With this result, the posterior probability, Pr(r | Y , G), can be computed up to an unknown normalizing 
constant, i.e., Pr(r|V,G) oc Pr(r)BF(r). We then implemented an Metropolis-Hastings algorithm, 
similar to what is discussed in [15] for multivariate linear regression model (MVLR), to efficiently traverse 
the space of 2 P possible T values. In particular, we implemented a novel proposal distribution that 
utilizes marginal and conditional analysis results to prioritize SNPs with strong marginal or conditional 
association signals. In practice, we observed that the resulting MCMC algorithm achieves fast mixing. 
The details of the algorithm is provided in the supplementary material. In the end, the MCMC algorithm 
yields a set of T samples from the posterior distribution, from which we computed the PIP for each SNP 
by marginalization. 

The posterior expected number of independent cis-eQTLs and its variance for each gene was obtained 
by computing the sample mean and variance of the number of non-zero 7,-'s in each posterior model. 
Equivalently, the posterior expected number of cis-eQTLs can be computed by the sum of PIPs, i.e., 



E 



Y,G 



£Pr( 7i =l|y,G). (11) 



For the fine mapping analysis of the GEUVADIS data, the MCMC algorithm was applied for each 
identified eGene individually. We carried out 25,000 burnin steps and 50,000 repeats for each MCMC 
run. Taking advantage of parallel processing, we performed multiple-SNP analysis for multiple genes 
simultaneously in a distributed computing environment, which greatly reduced the overall computational 
time. 

4.4 Enrichment Analysis of ds-eQTLs 

We specify the prior distribution of Pr(7 3 ) by the parametric function Q to incorporate genomic anno- 
tations into the eQTL mapping, whereas the other parts of the Bayesian model remains intact. For the 
g-th gene and its j-th SNP, we re-write Q using an equivalent vector form as follows 

logit [Pr( 7 f = 1)] = ct'6 a s , (12) 
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where the additional super-script for gene emphasizes the annotation data are specific to each gene-SNP 
pair, and the enrichment parameter a is assumed to be shared across all genes. 

Let D 9 denote the collection of the annotation data for gene g. For a total of q genes, it follows from 
the Bayes rule that 

Pr (a, T 1 , r 9 I {Y 1 ,G l ,D 1 } . . . {Y q , G q , D q }) cx P(a) [Pr(r ff | D 9 , a) BF(T 9 )} . (13) 

Given a prior specification P(a) (e.g. a flat prior), the quantities on the right hand side are individually 
straightforward to compute (analytically). It is conceptually easy to modify the MCMC algorithm out- 
lined above to jointly sample (a, r , . . . , T q ). Nevertheless, due to the extremely high dimensionality of 
the target space (which is approximately the number of gene-SNP pairs, in the case of the GEUVADIS 
data ~ 6.7 million), the convergence of the MCMC within a reasonable time frame may be in doubt. Fur- 
thermore, the computational resources, especially the memory usage, demanded by the MCMC algorithm 
may be too high to afford in a practical setting. 

Alternatively, we consider an EM algorithm that focuses on finding the MLE of a by treating 
(r ,...,T q ) as missing data. The derivation of the EM algorithm is mostly straightforward, we give 
the relevant details in the supplementary materials. Briefly at (t + l)-th iteration, in the Expectation 
step (E-step), we evaluate E [-y 9 \ Y 9 , G 9 , a^~\ for each gene-SNP pair given the current estimate a^. It 
should be noted that this conditional expectation is exactly the PIP of each gene-SNP pair which can be 
obtained by running the multiple-SNP analysis for each gene separately given the hyperparameter a'*' . In 
the Maximization step (M-step), we maximize a by fitting a logistic regression model relating PIP of each 
gene-SNP pair to the genomic annotations of the corresponding SNP. Overall, we describe the complete 
algorithm as an "MCMC-within-EM" algorithm, which is initiated at some arbitrary value, and 
iteratively performs multiple cis-eQTL mapping using the MCMC algorithm and maximization by fitting 
logistic regression models until convergence. The main computational benefit of the "MCMC-within- 
EM" algorithm is that the E-steps involving MCMC runs can be processed in parallel on a distributed 
computing system, hence the memory requirement is much relaxed. 

The approximate enrichment analysis procedure introduced in the Results section can be viewed as a 
special case of the MCMC-within-EM algorithm with a single iteration initiated at a' 0 - 1 = (ao, 0, . . . , 0). 
In particular, ao is computed by ([l]), and this starting point represents the global null model of no 
enrichment of any annotations in consideration. This one-step maximization approach is statistically 
valid: under the null model, the likelihood function is maximized at a^ 0 ' with respect to the enrichment 
parameters, whereas under the alternative, the corresponding parameters should be identified away from 
0 in the maximization step. Analogous to the score test, the approximate procedure is most powerful 
if the true alternative is close to the null model, i.e., the magnitude of the enrichment is relatively 
small . Nonetheless in general, the absolute values of the estimated enrichment parameters from this 
procedure represent a lower bound of the true absolute values of MLEs. The most attractive property 
of this approximate enrichment procedure is probably its computational cost, as it further simplifies the 
complete MCMC-within-EM procedure. 
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Supplementary Materials 

MCMC Algorithm for Mapping Multiple cis-eQTLs 

We implement a Metropolis-Hastings algorithm to perform posterior sampling based on equation (|3| in 
the main text. The algorithm is mostly straightforward, to help the Markov chain achieve fast mixing, we 
implement a novel proposal distribution based on the result of conditional analysis of multiple cis-eQTLs. 
We propose two simple "local" moves in the MCMC simulations: 

1. Change a jj value for SNP j 

2. Swap the values of 7^ and ■y k , for SNPs j and k 

where each SNP j is proposed according to a pre-calculated weight Wj . The novelty of the proposal distri- 
bution is that we construct the weights Wj's based on the conditional analysis results. More specifically, 
we start by computing Bayes factors for each cis-SNP in a single SNP analysis, and compute a quantity 

(i) = BF J (u) 

(Note that pj. is proportional to the PIP for SNP j assuming only one eQTL in the cis region and a 

uniform prior inclusion probability). We then find the SNP with the maximum pj 1 ' value, say SNP k. In 

(2) 

the next round, we control for the genotype of SNP k and repeat the single SNP analysis to obtain p^ , 
which mimics the conditional analysis of secondary cis-eQTL signals. Note that SNP k and the SNPs in 

LD will have single SNP Bayes factor close to 1 in this round. We again add the SNP with the maximum 

(2) 

Pj value into the control set. We repeat this procedure, with one additional SNP added into the control 
set in each round, until the maximum single SNP Bayes factor falls below a pre-defined threshold (we use 
10 in practice). Suppose the procedure ends in t iterations, we then compute the weight for each SNP 
using 

Wj = J2^ r) +6t+i-, (15) 

where 9\, 9t+i forms a decreasing geometric series summing up to 1. The trailing ^ term in the weight 
calculation represents a uniform distribution on candidate cis-SNPs. 



This particular proposal distribution is an extension of what is used in 33 , and should be credited 
to Matthew Stephens (personal communication). Its theoretical backend is related to sure-independence 
screening proposed by [34] in variable selection context. 



Maximum Likelihood Inference of Enrichment Parameters 

This section gives the technical details of MCMC-within-EM algorithm. Given the hierarchical model 
described in the main text, we are interested in performing maximum likelihood inference of enrichment 
parameter a. Treating {T 1 , ...T q } across all q genes as missing data, the complete data likelihood can be 
written as 

P({Y 9 }, {T 9 } I {G 9 }, {D 9 }, a) = J] P(T 9 \ D 9 , a) • f[ P(Y°\T°, G 9 ). (16) 

3=1 9=1 

We apply an EM algorithm to find the MLE of a. Because vector 7? only takes values in {0, 1}, using a 
loose notation, we represent vectors 0 and 1 with the corresponding binary scalar values. It then follows 
that 

P(T 9 \D 9 ,a)=Y[ 

j 



exp(a'8?) 
1 + exp(a'^^) 



7| 



-IT 



1 + exp(a'^^ 



(17) 
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The complete data log-likelihood is given by 

logi(a; {Y 9 }, {T 9 }, {G 9 }, {D 9 }) = £ ]T l] («'<5f ) "EE W + «p(a'^)] 

0 =ii=l 3=13=1 (lg) 

+ ^io g [p(^|r^G^] 

9=1 

The EM algorithm initiates by an arbitrary value of a, namely, a^K In the E-step of t-th iteration, we 
compute 

q Pg 



E[log L(a ; {Y 9 }, {T 9 }, {G 9 }, {D 9 }) \ {Y 9 }, {G 9 }, a«] = £ £ E ( 7 f | F 9 , G s , a«) (a'5j) 

9=1 3=1 

"EE + cx P(«'^)] +E E (log[^(^ 9 l T 9 , G 9 )} I Y 9 ,G 9 , a<«>) 



3=1 3=1 9=1 

Note that the last term does not contain parameter a. In the M-step of the t-th iteration, we find 

(q Pg q Pg \ 

EE E (^ 9 i y9 ' G9 ' a(!) ) («'<*i)-EE log [ 1+exp ( a/<5 P ( 2 °) 
3=1 3=1 9=1 3 = 1 / 



The objective function in (20 1 coincides with the log-likelihood function of a logsitic regression model 
treating each gene-SNP pair as an independent observation, however with the usual binary response vari- 
able replaced by the conditional expectations. By this connection, the maximization step can be carried 
out by fitting the corresponding modified logistic regression model treating conditional expectations as 
responses (i.e., via an iterative re- weighted least square algorithm). This also implies that in the E-step, 
it is only required to compute E(( 7 ? | Y 9 , G 9 , t*W) = Pr^ = 1 1 Y 9 , G 9 , t*W), i.e., the PIP for each 
gene-SNP pair, which we obtain from the MCMC sampling. 

To summarize, we outline the procedure of the MCMC-within-EM algorithm based on the above 
derivation as follows 

1. At t = 1, initiate a. = a' 1 ) 

2. Compute prior Pr(T 9 | D s , and run MCMC algorithm for multiple czs-eQTL analysis for 
each gene g 

3. Compute Pr(7 9 = 1 1 Y 9 , G 9 , ay)) for each gene-SNP pair from the posterior samples 

4. Find c*(* +1 ) by fitting a logistic regression model treating Pr(7 9 = 1 1 Y 9 , G 9 , a^) as response 
variable and {D 9 } as observed covariates 

5. Repeat 2 to 4 until convergence 
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Figure Legends 
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Fig. 1. An example of modest yet consistent eQTL signals across population groups. The forest plot 
shows the genetic effects of SNP rs7207370 with respect to the expression levels of gene NME1 
(ensemble ID: ENSG00000239672). SNP rs7207370 is one of the top associated ds-SNPs in all 
population groups, yet the strengths of the association signals are modest in all groups (the 
maximum single SNP Bayes factor, among five groups, is 18.0 in FIN, the corresponding 
gene-level Bayes factor is 1.8). As a consequence, the gene is not identified as an eGene in any 
of the separate analyses. Across populations, the SNP exhibit a strongly consistent association 
pattern. In the cross-population meta-analysis, the gene-level Bayes factor reaches 1.1 x 10 4 
(single SNP Bayes factor for rs7207370 is 9.5 x 10 5 ). 
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Fig. 2. Histogram of posterior expected number of cis-eQTLs in 6,555 identified eGenes. The figure 
indicates that for most of the eGenes, we identified only single cis-eQTLs. However, for a 
non-trivial proportion of eGenes, multiple independent eQTL signals were identified. 
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Fig. 3. An example of a gene harboring four independent cis-eQTL signals. The top left panel plots 
the ds-SNPs with PIP > 0.02. The locations of the SNPs are with repect to the TSS of gene 
LHPP. The ticks on the x-axis indicate all interrogated cis-SNPs in the region. The SNPs with 
the same color are in high LD and represent the same eQTL signal. In the plot, the sums of the 
PIPs from the SNPs in the same colors are all ~ 1, indicating that we are confident of the 
existence of each signal. The heatmaps show the LD patterns in each of the population group. 
They are qualitatively similar, except that the SNPs representing the first signals are 
monomorphic in GBR. In the bottom panel, we plot the effect sizes of eQTLs jointly estimated 
from one of the high posterior probability models. Each of the SNP plotted belongs to a 
different colored cluster in the PIP plot (as indicated by the color coding of the error bars). 
The effect sizes and standard errors are estimated from the multiple linear regression models 
(containing all four SNPs) separately fitted in each population group. All the signals show 
strong consistency across populations. 
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Fig. 4. An example of automatic LD filtering across population groups. The top panel shows the 

result of multiple cis-eQTL analysis for gene AG03 using only the data from TSI. The SNPs 
with PIPs > 0.02 are plotted. All SNPs plotted are in high LD in TSI, and the sum of the 
PIPs across the genomic region is close to 1. The region spanned by the signals is ~ 140 kb. 
We repeated the analysis jointly across all five populations, the SNPs with PIPs > 0.02 arc 
plotted in the middle panel. The genomic region harboring the eQTL is narrowed down into a 
1.2 kb region enclosed by three SNPs each with PIP ~ 0.33. The bottom panel shows the LD 
heatmaps between the 41 SNPs plotted in the top panel in TSI, GBR and YPJ, respectively. 
The multiple cis-eQTL mapping method takes advantage of the varying LD patterns across 
populations, and automatically narrows down the region harboring the true causal cis-eQTL. 
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Fig. 5. Multiple SNP analysis helps explain observed strong heterogeneity in single SNP analysis. 
SNP rs6006800 and the SNPs in LD (whose genomic locations are indicated by the purple 
triangels in the top panel) in the cis-region of gene TTC38 shows distinct opposite effects in 
European and YRI populations when analyzed alone. The middle left panel shows the effect 
sizes of rs6008600 estimated in the single SNP analysis in the five populations. The top panel 
shows the multiple czs-eQTL analysis result. SNPs with PIPs > 0.02 are plotted. The result 
suggests that there are two independent signals in the region, with one represented by SNPs 
colored in green, and the other represented by SNPs colored in brown. The sums of the green 
and brown SNPs are both very close to 1. SNP rs6008600 and the SNPs in LD all have PIPs 
~ 0 in the multiple cis-eQTL analysis. The middle right panel shows the effect sizes of 
rs6006800 estimated from the multiple linear regression models controlling for the two 
independent signals in each population: the genetic association observed in the single SNP 
analysis is seemingly "explained away" by the two independent signals identified by the fine 
mapping analysis. The bottom panel shows LD heatmaps between SNPs highlighted in the top 
panel (green, brown SNPs and SNPs in LD with rs6006800) in the five populations. Some of 
the green SNPs are monomorphic in YRI. The opposite effects of rs6006800 is clearly explained 
by the varying LD patterns: rs6006800 is in high LD with the brown SNPs in YRI, whereas in 
European populations it tags the green SNPs. 
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(a) Distribution of posterior expected number of cis-eQTLs with respect to distance to TSS 
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Fig. 6. Enrichment of cis-cQTLs with repect to distance to TSS. 
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Fig. 7. Comparison of fine mapping results of gene LY86 before and after incorporating functional 

annotations. The top panel is based on the analysis using uniform prior ([5]). The bottom panel 
is based on the analysis using prior Q incorporating SNP distances to TSS and the annotations 
for binding variants and footprint SNPs. In both panels, SNPs with PIPs > 0.02 are plotted. 
There are clearly three independent cis-eQTLs in the region, represented by different colors of 
SNPs. SNPs in smae colors are in high LD. The sums of PIPs from SNPs in same colors are all 
close to 1. The circled points are predicted binding variants. It is clear that binding variants 
are up-weighted when annotation information is incorporated into the fine mapping analysis. 
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Population 


Sample Size 


Number of eGenes 


YRJ 


77 


1042 


CEU 


78 


960 


TSI 


92 


1803 


GBR 


84 


2078 


FIN 


89 


2100 


META 




6555 



Tab. 1. Comparison of eGene discovery by separate vs. meta analysis. eGenes were declared by 

rejecting the null hypothesis of no cis-eQTLs in the cis region at 5% FDR level. The difference 
of identified eGenes between CEU, YRI and the rest of the European populations in the 
separate analysis was likely due to the cell line effects (T. Lappalainen, personal 
communication). Clearly, the meta-analysis improved power of eGene discovery by aggregating 
samples across population groups. 



